Student Information

Name:吳霽函

Student ID:109065523

GitHub ID:crescendoCat


Instructions

  1. First: do the take home exercises in the DM2020-Lab1-Master Repo. You may need to copy some cells from the Lab notebook to this notebook. This part is worth 20% of your grade.
  1. Second: follow the same process from the DM2020-Lab1-Master Repo on the new dataset. You don't need to explain all details as we did (some minimal comments explaining your code are useful though). This part is worth 30% of your grade.
    • Download the the new dataset. The dataset contains a sentence and score label. Read the specificiations of the dataset for details.
    • You are allowed to use and modify the helper functions in the folder of the first lab session (notice they may need modification) or create your own.
  1. Third: please attempt the following tasks on the new dataset. This part is worth 30% of your grade.
    • Generate meaningful new data visualizations. Refer to online resources and the Data Mining textbook for inspiration and ideas.
    • Generate TF-IDF features from the tokens of each text. This will generating a document matrix, however, the weights will be computed differently (using the TF-IDF value of each word per document as opposed to the word frequency). Refer to this Sciki-learn guide .
    • Implement a simple Naive Bayes classifier that automatically classifies the records into their categories. Use both the TF-IDF features and word frequency features to build two seperate classifiers. Comment on the differences. Refer to this article.
  1. Fourth: In the lab, we applied each step really quickly just to illustrate how to work with your dataset. There are somethings that are not ideal or the most efficient/meaningful. Each dataset can be habdled differently as well. What are those inefficent parts you noticed? How can you improve the Data preprocessing for these specific datasets? This part is worth 10% of your grade.
  1. Fifth: It's hard for us to follow if your code is messy :'(, so please tidy up your notebook and add minimal comments where needed. This part is worth 10% of your grade.

You can submit your homework following these guidelines: Git Intro & How to hand your homework. Make sure to commit and save your changes to your repository BEFORE the deadline (Oct. 22th 11:59 pm, Thursday).

1. Take Home Exercises

Data Preparation

Preparing the data for Take Home Exercises.

>>> Exercise 2 (take home):

Experiment with other querying techniques using pandas dataframes. Refer to their documentation for more information.

My Answer

The following cells, I try to figure out the number of the row which text's lenght is over 700.\ And I try drawing a bar plot to visualize this thing

Also we can set different threshold on text length to see a rough distribution of text length in the data set.

>>> Exercise 5 (take home)

There is an old saying that goes, "The devil is in the details." When we are working with extremely large data, it's difficult to check records one by one (as we have been doing so far). And also, we don't even know what kind of missing values we are facing. Thus, "debugging" skills get sharper as we spend more time solving bugs. Let's focus on a different method to check for missing values and the kinds of missing values you may encounter. It's not easy to check for missing values as you will find out in a minute.

Please check the data and the process below, describe what you observe and why it happened.
$Hint$ : why .isnull() didn't work?

My Answer

The value in 2 and 3 are 'NaN', 'None'. pd.isnull() will see them as a 'string'. Since they are 'string' types, they are not 'null' anymore. \ Also, the value in 5 is a empty string, its also a string so pd.isnull() will take it as a 'string' type instead of python 'None' type

>>> Exercise 6 (take home):

Notice any changes to the X dataframe? What are they? Report every change you noticed as compared to the previous state of X. Feel free to query and look more closely at the dataframe for these changes.

My Answer

The size of X_sample is 1000, smaller then original X. And also the order of X_sample is disrupted due to the random sample process.

>>> Exercise 8 (take home):

We can also do a side-by-side comparison of the distribution between the two datasets, but maybe you can try that as an excerise. Below we show you an snapshot of the type of chart we are looking for.

alt txt

>>> Exercise 10 (take home):

We said that the 1 at the beginning of the fifth record represents the 00 term. Notice that there is another 1 in the same record. Can you provide code that can verify what word this 1 represents from the vocabulary. Try to do this as efficient as possible.

My Answer

And we know the 1s in the word vector means. The first 1 means 00 and the second 1 means 01

>>> Exercise 11 (take home):

From the chart above, we can see how sparse the term-document matrix is; i.e., there is only one terms with frequency of 1 in the subselection of the matrix. By the way, you may have noticed that we only selected 20 articles and 20 terms to plot the histrogram. As an excersise you can try to modify the code above to plot the entire term-document matrix or just a sample of it. How would you do this efficiently? Remember there is a lot of words in the vocab. Report below what methods you would use to get a nice and useful visualization

>>> Exercise 12 (take home):

Please try to reduce the dimension to 3, and plot the result use 3-D plot. Use at least 3 different angle (camera position) to check your result and describe what you found.

$Hint$: you can refer to Axes3D in the documentation.

>>> Exercise 13 (take home):

If you want a nicer interactive visualization here, I would encourage you try to install and use plotly to achieve this.

>>> Exercise 14 (take home):

The chart above contains all the vocabulary, and it's computationally intensive to both compute and visualize. Can you efficiently reduce the number of terms you want to visualize as an exercise.

>>> Exercise 15 (take home):

Additionally, you can attempt to sort the terms on the x-axis by frequency instead of in alphabetical order. This way the visualization is more meaninfgul and you will be able to observe the so called long tail (get familiar with this term since it will appear a lot in data mining and other statistics courses). see picture below

alt txt

>>> Exercise 16 (take home):

Try to generate the binarization using the category_name column instead. Does it work?

My Answer

It worked. But be careful when doing the fit process of the binarizer. \ The fit process shold use X.category_name as well. \ Like: \ mlb.fit(X.category_name) \ But not \ mlb.fit(X.category) \ This will have problem when doing mlb.transform(X['category_name'])

2. The New Dataset

Tasks to Do

  1. Data Source
  2. Data Preparation
  3. Data Transformation \ 3.1 Converting Dictionary into Pandas dataframe \ 3.2 Familiarizing yourself with the Data
  4. Data Mining using Pandas \ 4.1 Dealing with Missing Values \ 4.2 Dealing with Duplicate Data
  5. Data Preprocessing \ 5.1 Sampling \ 5.2 Feature Creation \ 5.3 Feature Subset Selection \ 5.4 Dimensionality Reduction \ 5.5 Atrribute Transformation / Aggregation \ 5.6 Discretization and Binarization
  6. Data Exploration
  7. Conclusion
  8. References

1. Data Source

The new dataset is about a set of sentence labelled with positive or negative sentiment. \ There are three files I downloaded contain the data in the ./data/ folder:

  1. amazon_cells_labelled.txt
  2. imdb_labelled.txt
  3. yelp_labelled.txt

So, let's start to prepare our data.

2. Data Preparation

Load the data using pd.read_table() function

3. Data Transformation

Since we already use the pandas DataFrame to read the data, this step we just combine the datasets all together, and skip 3.1 and 3.2.

We got totally 3000 records above.

4. Data Mining using Pandas

4.1 Dealing with Missing Values

Seems that there are no missing value.

4.2 Dealing with Duplicate Data

Seems dealed!

5. Data Preprocessing

5.1 Sampling

5.2 Feature Creation

5.3 Feature subset selection

5.4 Dimensionality Reduction

5.5 Atrribute Transformation / Aggregation

5.6 Discretization and Binarization

Since the label in the New Dataset is already binarized(positive sentences are labelled by 1 and negetive sentences are labelled by 0). So we can just skip this step.

6. Data Exploration

In above cell, the Similarity between 1, 2, 3 are all 0 since they are exactly totally different sentences. So the outcome is acceptable.

  1. Third: please attempt the following tasks on the new dataset. This part is worth 30% of your grade.
    • Generate meaningful new data visualizations. Refer to online resources and the Data Mining textbook for inspiration and ideas.
    • Generate TF-IDF features from the tokens of each text. This will generating a document matrix, however, the weights will be computed differently (using the TF-IDF value of each word per document as opposed to the word frequency). Refer to this Sciki-learn guide .
    • Implement a simple Naive Bayes classifier that automatically classifies the records into their categories. Use both the TF-IDF features and word frequency features to build two seperate classifiers. Comment on the differences. Refer to this article.

3. Classify the Dataset

  1. Visualizations
  2. TF-IDF Features
  3. Naive Bayes Classification \ 3.1 Using TF-IDF \ 3.2 Using Word Frequency Features \ 3.3 Simple Discussion

1. Visualizations

  1. 1 Using a frequency bar plot, but sort the terms using frequency first

But actually most of the first 300 items are the stop words, like the, it, is …and so on. \ So I try to plot the frequency plot without stop words.

The explained_variance_ attribute of pca_3 can represent the importance of each component. \ So we can visualize the importance of the components via a bar plot

2. TF-IDF Features

3.Naive Bayes Classification

3.1 Using TF-IDF

3.2 Using Word Frequency Features

And we got a accuracy table

3.3 Simple Discussion

First, let me draw some comparation graph to get a more visualized outcome

For Bernoulli Naive Bayesian Classifier, the accuracy of TF-IDF(tfidf) and Frequency(counts) are same. \ Since Bernoulli treat numbers as binary.

The graph above shows two vector on the first record of our data -- the first sentence in amazon_cells_labelled.txt -- "So there is no way for me to plug it in here in the US unless I go by a converter." \ Let's look at the terms and statistics numbers more clearly.

Counts or TF-IDF Vectors

Mapping the terms and numbers all together. The vector is very sparse that the value not in the table are all zero. \ And a Bernoulli will treat all positive value above as 1, so TF-IDF vector and counts vector in Bernoulli are the same! \ This is how Bernoulli Naive Bayesian Classifier has same accuracy on two different features. \ But for Multinomial Naive Bayesian Classifier, this classifier treat numbers in multinomial distribution. Since the treatment is no longer 0 or 1, I expected the outcome of TF-IDF will better than simple Frequency based CountVector. \ But the truth is not. CountVector even better than TF-IDF, although the difference of accuracy is less than 0.01. \ So I try to use a cross validation to see the average performance of two dataset.

And also plot a ROC Curve

From the similarity matrix we can see X_tfidf and X_counts are very similar. So the prediction by classifier using two matrix will also very similar. \ And TF-IDF feature may not so powerful in this dataset

Stop Words

For those classifiers using a dataset without stop words, the everage performance are lower then those who using a dataset with stop words. \ This also shows that sometimes ignore stop words may have a loss on classification.

4. Lab1 Feedback

For me, the most inefficent part must be counting the frequency of terms. \ in

term_frequencies = []
for j in progressbar.progressbar(range(0,X_counts.shape[1])):
    term_frequencies.append(sum(np.ravel(X_counts[:,j].toarray())))

I use a progressbar to show a visualized progress when my cpu is doing this long compute. The problem of the efficecy may come frome the

X_tfidf[:, j].toarray()

term. We have about 30k terms in our dataset and it will do 30k times toarray() ! \ And the sum() function is also implemented by numpy library. \ So in my homework I replaced all frequency computation into

term_frequencies = np.ravel(np.sum(X_counts, axis=0))

This will do a totally same work as the codes above, but much faster.

Let check if the outcome is same or not

Well, the faster one is about 100k faster than the slower one.

The other part, such as Pandas missing value checking, duplicated value checking, Bag-of-Word computing(CountVtorizer), PCA feature reducing, are all very useful techniques on data preprocessing and mining. \ A lots of thanks to TAs!